Add metal probe disk cleaning by stefanhipfel · Pull Request #888 · ironcore-dev/metal-operator

stefanhipfel · 2026-05-13T12:28:17Z

Summary

Adds disk cleaning inside the metal probe image

Features

Smart erasure: blkdiscard for SSDs, shred/dd for HDDs
Concurrent: Up to 4 disks in parallel with timeouts (10min/24h)
Persistent state: Marker file prevents re-cleaning on restart
Security: Path validation, command injection prevention
Happens before agent sends discovery, so Server will not become available while disk cleaning is still happening.

Testing

Tested on Linux VM with "real disks" (via Lima on Mac)

Fixes #704

Summary by CodeRabbit

Release Notes

New Features
- Added disk cleaning capability to the probe agent with two configurable operation modes: quick (standard erasure) and secure (enhanced secure erase).
- Introduced CLI configuration options to enable disk cleaning and select the preferred cleaning mode.
- Disk cleaning process executes once automatically following successful server registration and is supported on Linux and macOS.

coderabbitai · 2026-05-13T12:33:30Z

Warning

Rate limit exceeded

@stefanhipfel has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 39 minutes and 22 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1a0ae117-963e-4910-af17-23b8604d63cd

📥 Commits

Reviewing files that changed from the base of the PR and between c5647b2 and a966eb3.

📒 Files selected for processing (10)

cmd/metalprobe/main.go
config/manager/manager.yaml
internal/controller/conditions.go
internal/controller/server_controller.go
internal/controller/server_controller_test.go
internal/controller/suite_test.go
internal/probe/diskcleaning_darwin.go
internal/probe/diskcleaning_linux.go
internal/probe/probe.go
internal/probe/probe_suite_test.go

📝 Walkthrough

Walkthrough

This PR implements disk-cleaning support in the probe agent with platform-specific logic (Linux with quick/secure modes, Darwin stub), CLI configuration in metalprobe, integration into the agent's post-registration lifecycle, and corresponding test and configuration updates.

Changes

Probe Agent Disk Cleaning

Layer / File(s)	Summary
Agent Structure and Lifecycle Integration `internal/probe/probe.go`	`Agent` struct gains `CleanDisks` and `DiskCleaningMode` fields; `NewAgent` constructor updated to accept disk-cleaning parameters; new `PerformDiskCleaning(ctx)` method executes cleaning once after initial registration; `Start` invokes it without interrupting the LLDP refresh loop if cleaning fails.
metalprobe CLI Configuration and Validation `cmd/metalprobe/main.go`	Adds `--clean-disks` (boolean) and `--disk-cleaning-mode` (string, default `"quick"`) CLI flags; validates mode is `"quick"` or `"secure"` at startup and exits with error for invalid values; passes configuration to `probe.NewAgent(...)`.
Linux Disk Cleaning: Core Logic and Strategies `internal/probe/diskcleaning_linux.go`	Implements `cleanDisks` with completion-marker persistence, device enumeration via `ghw.Block()`, and bounded-concurrency pool. `quickCleanDisk` (10-min timeout) wipes first 10MiB and optionally last 10MiB; `secureCleanDisk` (24-hour timeout) detects rotational/SSD via sysfs and uses `blkdiscard --secure` (SSD), `shred`, or `dd` fallback; aggregates per-disk results and returns aggregate error if any disk fails.
Linux Disk Cleaning: Helper Functions `internal/probe/diskcleaning_linux.go`	Provides utilities: `wipeDiskRangeNative` (zero-fill via Go file ops with context cancellation), `isRotational`/`isReadOnly` (sysfs reads), `getDiskSize` (parses `blockdev --getsize64`), `executeBlkDiscard` (invokes `blkdiscard` with optional `--secure`), `rereadPartitionTable` (executes `blockdev --rereadpt`).
Darwin Platform Stub `internal/probe/diskcleaning_darwin.go`	Platform-gated (`//go:build darwin`) stub that always returns error indicating disk cleaning is Linux-only.
Manager Configuration and Condition Constants `config/manager/manager.yaml`, `internal/controller/conditions.go`	Extends manager Deployment to disable disk cleaning and set default mode to `"quick"`; adds `ConditionDiskCleaningCompleted` constant in controller.
Test Updates `internal/controller/server_controller_test.go`, `internal/probe/probe_suite_test.go`, `internal/controller/suite_test.go`	Updates controller tests to pass new `"quick"` parameter to `probe.NewAgent(...)`; updates probe test suite to pass `false` and `"quick"` to agent initialization; removes stale comment from test setup.
Minor Inline Adjustments `internal/controller/server_controller.go`	Single-line edit in `applyDefaultIgnitionForServer` near `probeFlags` construction.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested labels

size/L

Suggested reviewers

afritzler
nagadeesh-nagaraja

🚥 Pre-merge checks | ✅ 1 | ❌ 4

❌ Failed checks (3 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Linked Issues check	⚠️ Warning	The PR implements disk cleaning in the metal-probe agent but does not include ServerCleaning controller tests/implementation that issue `#704` specifically requests.	Add ServerCleaning controller with comprehensive unit tests covering creation, reconciliation, state transitions, and ensure ServerCleaningReconciler is properly initialized in the test suite.
Out of Scope Changes check	⚠️ Warning	The PR includes disk cleaning implementation at the probe level as discussed, but the linked issue `#704` explicitly requires ServerCleaning controller implementation which is out of scope here.	Either expand scope to include ServerCleaning controller per issue `#704`, or clarify whether this PR should be limited to probe-level disk cleaning with a follow-up PR for operator logic.
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description covers key features and testing, but lacks detailed explanation of changes across multiple files and missing sections from the template.	Expand the description to explain changes to the Agent struct, NewAgent signature updates, new disk-cleaning modes, and validation logic. Consider adding a 'Testing' subsection with specific test cases covered.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Add metal probe disk cleaning' directly and clearly describes the main change: adding disk cleaning functionality to the metal probe subsystem.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch worktree-diskcleaning

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

🧹 Nitpick comments (1)

api/v1alpha1/server_types.go (1)

118-118: ⚡ Quick win

Use a fixed-width integer type for DisksProcessed.

The int type at line 118 lacks platform-independence in CRD schema generation. Use int32 (or int64) for explicit, stable API contracts as per Kubernetes conventions.

♻️ Proposed fix

-	DisksProcessed int `json:"disksProcessed,omitempty"`
+	DisksProcessed int32 `json:"disksProcessed,omitempty"`

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/v1alpha1/server_types.go` at line 118, The DisksProcessed field currently
uses the platform-dependent int type; change its type to a fixed-width integer
(e.g., int32) in the struct that defines DisksProcessed in
api/v1alpha1/server_types.go so the CRD schema is stable across platforms—update
the declaration from "DisksProcessed int `json:\"disksProcessed,omitempty\"`" to
"DisksProcessed int32 `json:\"disksProcessed,omitempty\"`" (or int64 if you
prefer 64-bit) and run codegen/CRD generation as needed to propagate the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/metalprobe/main.go`:
- Around line 40-41: Validate the diskCleaningMode flag immediately after
flag.Parse() in main() by checking that the string value of diskCleaningMode is
one of the allowed values ("quick" or "secure"); if not, print a clear error to
stderr or use log.Fatalf and exit with a non-zero status so the process fails
fast. Locate the diskCleaningMode variable and its usage around flag.Parse() and
replace the current silent acceptance with this explicit validation (and apply
the same validation if diskCleaningMode is set or re-read elsewhere).

In `@internal/controller/server_controller_test.go`:
- Around line 1313-1390: This test creates bmcSecret and server objects but
never deletes them; add explicit cleanup after creation (e.g., immediate defers
or explicit deletions before test exit) to remove the BMCSecret (bmcSecret) and
Server (server) resources via the k8sClient.Delete calls so they don't persist
across tests; locate the creation sites for bmcSecret and server in the It(...)
block and add cleanup logic (defer Expect(k8sClient.Delete(ctx,
bmcSecret)).To(Succeed()) and same for server) or call k8sClient.Delete with the
same objects/refs before returning, ensuring deletion is invoked even on test
failures.

In `@internal/controller/server_controller.go`:
- Around line 423-425: The call to r.handleDiskCleaningCompletion(ctx, server)
currently swallows errors (just logs them), so persistent updates like resetting
CleanOnce/status may never be retried; change the reconcile behavior to
propagate the error instead of ignoring it so the controller requeues the
request (or explicitly requeue with a backoff) when handleDiskCleaningCompletion
fails. Ensure handleDiskCleaningCompletion itself is idempotent (safe to run
multiple times) and performs the persistent update (reset CleanOnce / status
update) in a retryable way (e.g., update the Server via the client with
optimistic retry or patch), and then return any error up to the caller so
reconciler can retry.
- Around line 1303-1311: Change the hard-success wording to indicate an
attempt/completion that may have ignored non-fatal errors: update
server.Status.DiskCleaningStatus.Message from "Disk cleaning completed
successfully" to something like "Disk cleaning attempted during discovery" (or
"Disk cleaning completed; errors may have been ignored"), and when calling
r.Conditions.UpdateSlice for ConditionDiskCleaningCompleted replace the
UpdateReason("DiskCleaningSucceeded") and UpdateMessage("Disk cleaning completed
during discovery") with a non-absolute reason/message such as
UpdateReason("DiskCleaningAttempted") and UpdateMessage("Disk cleaning attempted
during discovery; non-fatal errors may have occurred") so the condition does not
assert an unequivocal success when failures can be ignored.
- Around line 1289-1302: After re-fetching the server object, guard against a
nil server.Spec.DiskCleaningPolicy before dereferencing its DiskCleaningMode:
check if server.Spec != nil and server.Spec.DiskCleaningPolicy != nil and only
then read server.Spec.DiskCleaningPolicy.DiskCleaningMode into
server.Status.DiskCleaningStatus.CleaningMode; otherwise set CleaningMode from
the reconciler default
(metalv1alpha1.DiskCleaningMode(r.DiskCleaningDefaultMode)) so you never
dereference a nil DiskCleaningPolicy when updating
server.Status.DiskCleaningStatus in the Server controller.

In `@internal/probe/diskcleaning_linux.go`:
- Around line 205-220: The code currently writes the completion marker even when
zero disks were actually cleaned; update the logic so
markDiskCleaningCompleted() is only called when at least one disk was
successfully cleaned and there are no failures. Concretely, after computing
successCount, results and skippedCount (symbols: successCount, results,
skippedCount, blockStorage.Disks, markDiskCleaningCompleted), first return an
error if successCount == 0 (e.g., "no disks cleaned" or treat as
skipped/failure), then keep the existing failure check (if successCount <
len(results) return fmt.Errorf(...)); only call markDiskCleaningCompleted() when
successCount > 0 and successCount == len(results).
- Around line 281-303: The code currently uses
exec.CommandContext(...).CombinedOutput() for long-running wiping commands
("shred" and "dd") which can OOM; change both usages to start the command and
stream stdout/stderr instead of buffering: create the *exec.Cmd via
exec.CommandContext (same invocations around shred and dd), call
cmd.StdoutPipe()/cmd.StderrPipe(), start cmd with cmd.Start(), and concurrently
io.Copy from the pipes into a bounded logger sink or io.Discard (or log lines
incrementally) and then wait with cmd.Wait()/cmd.Run() to get the exit status;
keep the existing error-handling logic (fall back from shred to dd and call
rereadPartitionTable on success) but remove CombinedOutput usage so output is
not fully buffered in memory.
- Around line 69-71: The current device validation in validateDevicePath
incorrectly accepts character devices; update the check so it only allows block
devices by verifying fi.Mode() has os.ModeDevice set and does NOT have
os.ModeCharDevice set (i.e., reject when os.ModeCharDevice is present). Modify
the conditional around validateDevicePath's os.ModeDevice check to explicitly
test and return an error if fi.Mode()&os.ModeCharDevice != 0, keeping the
existing error message for non-block devices.

---

Nitpick comments:
In `@api/v1alpha1/server_types.go`:
- Line 118: The DisksProcessed field currently uses the platform-dependent int
type; change its type to a fixed-width integer (e.g., int32) in the struct that
defines DisksProcessed in api/v1alpha1/server_types.go so the CRD schema is
stable across platforms—update the declaration from "DisksProcessed int
`json:\"disksProcessed,omitempty\"`" to "DisksProcessed int32
`json:\"disksProcessed,omitempty\"`" (or int64 if you prefer 64-bit) and run
codegen/CRD generation as needed to propagate the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5d5b7bd8-9842-46e8-93f0-e1666bccf357

📥 Commits

Reviewing files that changed from the base of the PR and between bf8e075 and 6bf1e57.

📒 Files selected for processing (19)

api/v1alpha1/applyconfiguration/api/v1alpha1/diskcleaningpolicy.go
api/v1alpha1/applyconfiguration/api/v1alpha1/diskcleaningstatus.go
api/v1alpha1/applyconfiguration/api/v1alpha1/serverspec.go
api/v1alpha1/applyconfiguration/api/v1alpha1/serverstatus.go
api/v1alpha1/applyconfiguration/utils.go
api/v1alpha1/server_types.go
api/v1alpha1/zz_generated.deepcopy.go
cmd/main.go
cmd/metalprobe/main.go
config/crd/bases/metal.ironcore.dev_servers.yaml
config/manager/manager.yaml
internal/controller/conditions.go
internal/controller/server_controller.go
internal/controller/server_controller_test.go
internal/controller/suite_test.go
internal/probe/diskcleaning_darwin.go
internal/probe/diskcleaning_linux.go
internal/probe/probe.go
internal/probe/probe_suite_test.go

coderabbitai · 2026-05-13T12:33:33Z

+		cmd := exec.CommandContext(ctx, "shred", "-v", "-n", "1", "-z", "--force", devicePath)
+		output, err := cmd.CombinedOutput()
+		if err != nil {
+			log.Error(err, "shred failed, falling back to dd", "disk", diskName, "output", string(output))
+			// Fall through to dd instead of returning error
+		} else {
+			// shred succeeded
+			if err := rereadPartitionTable(ctx, devicePath); err != nil {
+				log.V(1).Info("Warning: failed to re-read partition table", "error", err)
+			}
+			return nil
+		}
+	}
+
+	// Use dd as fallback (either shred not found or shred failed)
+	log.V(1).Info("Using dd for secure wipe", "disk", diskName)
+	cmd := exec.CommandContext(ctx, "dd",
+		"if=/dev/urandom",
+		"of="+devicePath,
+		"bs=1M",
+		"status=progress")
+	output, err := cmd.CombinedOutput()
+	if err != nil {


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid CombinedOutput() for long-running wipe commands.

shred -v and dd status=progress can emit large continuous output; buffering all of it in memory risks OOM. Prefer Run() with bounded/streamed logging (or disable verbose progress).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/probe/diskcleaning_linux.go` around lines 281 - 303, The code currently uses exec.CommandContext(...).CombinedOutput() for long-running wiping commands ("shred" and "dd") which can OOM; change both usages to start the command and stream stdout/stderr instead of buffering: create the *exec.Cmd via exec.CommandContext (same invocations around shred and dd), call cmd.StdoutPipe()/cmd.StderrPipe(), start cmd with cmd.Start(), and concurrently io.Copy from the pipes into a bounded logger sink or io.Discard (or log lines incrementally) and then wait with cmd.Wait()/cmd.Run() to get the exit status; keep the existing error-handling logic (fall back from shred to dd and call rereadPartitionTable on success) but remove CombinedOutput usage so output is not fully buffered in memory.

afritzler · 2026-05-13T15:06:30Z

Thanks @stefanhipfel for the PR. I commented in the referenced issue how we could simplify the cleanup phase: #613 (comment) I would suggest to take the discussion there.

afritzler · 2026-05-13T15:11:36Z


+// DiskCleaningMode defines the disk cleaning method.
+// +kubebuilder:validation:Enum=quick;secure
+type DiskCleaningMode string


I would hide this behind a global flag for now. E.g. --enable-disk-sanitatation or --disk-sanitazation-policy (None by default, secure or fast).

afritzler · 2026-05-13T15:12:11Z

 		return false, err
 	}

+	if err := r.handleDiskCleaningCompletion(ctx, server); err != nil {


This task I would put into an own controller creating claims with tolerations for those cleanup tasks.

stefanhipfel · 2026-05-15T07:51:30Z

Thanks @stefanhipfel for the PR. I commented in the referenced issue how we could simplify the cleanup phase: #613 (comment) I would suggest to take the discussion there.

@afritzler
Would it then make sense to only implement the disk cleaning in the metal-probe in this PR?
Then add a separate PR to implement the operator logic which triggers the cleaning.
fyi: I remember that we discussed to do the disk cleaning during discovery, to make it very simple for a quick solution.

This adds automated disk sanitization during server discovery to prepare bare-metal servers for reuse by removing old data, partition tables, and filesystem signatures. API Changes: - Add DiskCleaningPolicy to ServerSpec with Enabled, CleanOnce, and Mode fields - Add DiskCleaningStatus to ServerStatus with completion tracking - Add ConditionDiskCleaningCompleted condition type Operator Configuration: - Add --enable-disk-cleaning flag (default: false, opt-in) - Add --disk-cleaning-mode flag (default: "quick") - Three-level hierarchy: operator default → per-server → clean-once Controller: - Implement shouldEnableDiskCleaning() for three-level configuration resolution - Implement getDiskCleaningMode() for mode selection - Implement handleDiskCleaningCompletion() for CleanOnce auto-reset - Pass disk cleaning flags to probe via ignition config Probe Implementation: - Move disk cleaning into Agent lifecycle (after registration) - Concurrent disk cleaning with semaphore (max 4 disks in parallel) - Two modes: quick (wipe headers/footers) and secure (full disk wipe) - Safety checks: read-only detection, removable device filtering, block device validation - Flash storage detection via sysfs rotational flag - Secure erasure with blkdiscard for SSDs, shred/dd fallback for HDDs - Per-disk timeouts: 10 minutes (quick), 24 hours (secure) - Persistent marker file prevents re-cleaning on agent restart - Security: regex validation, path sanitization, command injection prevention Testing: - Add comprehensive unit tests for configuration hierarchy - Add parameterized tests for shouldEnableDiskCleaning (9 cases) - Add parameterized tests for getDiskCleaningMode (5 cases) - Add integration test for ignition flag generation - All existing tests pass Key Features: - Cleans ALL disks including OS disk (probe runs from RAM/PXE) - Non-fatal errors (discovery continues even if cleaning fails) - Follows OpenStack Ironic pattern (clean all disks by default) - Production-ready with security hardening and proper error handling Signed-off-by: Stefan Hipfel <stefan.hipfel@sap.com>

Signed-off-by: Stefan Hipfel <stefan.hipfel@sap.com>

Remove all controller/operator integration for disk cleaning to keep this PR minimal. The metalprobe binary retains --clean-disks and --disk-cleaning-mode flags for manual invocation. Signed-off-by: Stefan Hipfel <stefan.hipfel@sap.com>

stefanhipfel requested a review from a team as a code owner May 13, 2026 12:28

github-project-automation Bot added this to Roadmap May 13, 2026

github-actions Bot added api-change size/XXL labels May 13, 2026

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

stefanhipfel force-pushed the worktree-diskcleaning branch from 6bf1e57 to df1de82 Compare May 13, 2026 12:34

afritzler requested changes May 13, 2026

View reviewed changes

hardikdr added the area/metal-automation label May 14, 2026

github-actions Bot added size/L and removed size/XXL labels May 15, 2026

stefanhipfel changed the title ~~Add disk cleaning feature for server discovery~~ Add metal probe disk cleaning May 15, 2026

stefanhipfel force-pushed the worktree-diskcleaning branch from c5647b2 to d307f97 Compare May 15, 2026 13:53

stefanhipfel added 3 commits May 15, 2026 15:54

fixes coderabbit issues

ae83f1a

Signed-off-by: Stefan Hipfel <stefan.hipfel@sap.com>

stefanhipfel force-pushed the worktree-diskcleaning branch from d307f97 to a966eb3 Compare May 15, 2026 13:55

Conversation

stefanhipfel commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Features

Testing

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (3 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

afritzler commented May 13, 2026

Uh oh!

afritzler May 13, 2026

Choose a reason for hiding this comment

Uh oh!

afritzler May 13, 2026

Choose a reason for hiding this comment

Uh oh!

stefanhipfel commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stefanhipfel commented May 13, 2026 •

edited

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading

stefanhipfel commented May 15, 2026 •

edited

Loading